D3EGFR: a webserver for deep learning-guided drug sensitivity prediction and drug response information retrieval for EGFR mutation-driven lung cancer

Abstract As key oncogenic drivers in non-small-cell lung cancer (NSCLC), various mutations in the epidermal growth factor receptor (EGFR) with variable drug sensitivities have been a major obstacle for precision medicine. To achieve clinical-level drug recommendations, a platform for clinical patient case retrieval and reliable drug sensitivity prediction is highly expected. Therefore, we built a database, D3EGFRdb, with the clinicopathologic characteristics and drug responses of 1339 patients with EGFR mutations via literature mining. On the basis of D3EGFRdb, we developed a deep learning-based prediction model, D3EGFRAI, for drug sensitivity prediction of new EGFR mutation-driven NSCLC. Model validations of D3EGFRAI showed a prediction accuracy of 0.81 and 0.85 for patients from D3EGFRdb and our hospitals, respectively. Furthermore, mutation scanning of the crucial residues inside drug-binding pockets, which may occur in the future, was performed to explore their drug sensitivity changes. D3EGFR is the first platform to achieve clinical-level drug response prediction of all approved small molecule drugs for EGFR mutation-driven lung cancer and is freely accessible at https://www.d3pharma.com/D3EGFR/index.php.


INTRODUCTION
Lung cancer is the most common malignant disease and the leading cause of cancer mortality worldwide, causing approximately 2.2 million new cases and 1.8 million deaths in 2020 [1].Non-small-cell lung cancer (NSCLC) accounts for 85% of all lung malignancies [2,3], mainly comprising adenocarcinoma (ADC), squamous cell carcinoma and large cell carcinoma.Epidermal growth factor receptor (EGFR) mutations are closely associated with carcinogenesis [4], and have been identified in approximately 32.3% of NSCLC [5].Mutations in the kinase domain of EGFR can promote ligand-independent dimerization and activation of the receptor, resulting in constitutive activation of downstream signaling pathways to induce tumorigenesis [6,7].
EGFR-tyrosine kinase inhibitors (EGFR-TKIs) are used as the standard treatment for patients with advanced EGFR mutationdriven lung cancer [8,9].In patients with EGFR-sensitive mutations, compared with platinum-based chemotherapy, EGFR-TKIs significantly improve the objective response rate and prolong progression-free survival (PFS) and overall survival (OS) rates [10][11][12].However, patients with different EGFR mutations exhibit varying responses to EGFR-TKIs, mainly because of either intrinsic or acquired resistance [13].The development of DNA sequencing technologies has enabled the identification of several novel and uncharacterized EGFR variants [14], which makes precision medicine more challenging for patients with new mutations [15,16].
To date, only nine small-molecule drugs have been approved for the treatment of patients with metastatic EGFR mutationpositive NSCLC worldwide (Table S1).The first-generation EGFR-TKIs, gefitinib and erlotinib, have been used as the first-line therapy for patients with common EGFR mutations such as exon 19 deletion (19del) or L858R point mutation [17,18], and the thirdgeneration agent, osimertinib, could benefit patients with the T790M resistant mutation [19,20].However, the efficacy of these EGFR-TKIs in the treatment of patients with uncommon or new EGFR mutations remains inadequately elucidated.
Profiting from the cumulative experience of relevant clinical trials over the past two decades, the risks of adverse effects and poor therapeutic efficacy in patients with common mutations could remain low throughout the entire treatment period.Noteworthy, the individual characteristics of patients, including gender, age and smoking status, are also related to the incidence rate of EGFR mutation-driven lung cancer [5,21].Although considerable progress has been made in integrating information on EGFR mutants and targeted drugs [22][23][24][25][26][27], systematic retrospective clinical analysis has been limited by the absence of credible resources profiling the clinical characteristics and outcomes of patients with EGFR mutations.Thus, a comprehensive and searchable database with details of patient cases, including EGFR mutation, clinicopathological characteristics and therapeutic response of approved drugs, is highly needed for making treatment decisions.
In addition, for rare or newly emerged variants, the EGFR mutation status has been attempted as a predictive and prognostic marker for predicting the effect of targeted therapy [28][29][30].For instance, Ikemura et al. [31] successfully predicted the diverse in vitro and in vivo sensitivities of exon 20 insertion mutants using molecular dynamics (MD) simulations, in which the G bind value for a certain mutant-inhibitor complex can be obtained in approximately 1 week.Wang et al. [32] combined MD and extreme learning machines to construct a personalized drug resistance prediction model.However, the limitations of high time consumption and the computational costs of MD simulations obstruct their wide application.In addition, the previous studies only predict drug response for two or fewer drugs (Table S2).Recently, artificial intelligence has shown increased expressive power in identifying, processing and extrapolating drug-target interactions based on existing biological activity data [33,34], which could be an effective tool for developing a fast and accurate drug sensitivity prediction model for rare and newly emerged mutations.
In this study, we aimed to investigate the impact of EGFR mutations on drug sensitivity and provide optimal treatment guidance through a real patient case database and a drug sensitivity prediction tool.First, the overall information on the D3EGFRdb clinical patient database was introduced, including the number of literature sources and cases, the distribution of individual patient characteristics and the analysis of statistical results.Second, the feasibility of molecular docking and deep learning approaches in predicting drug sensitivity was evaluated and the selected deep learning model was used to explore potential changes in drug sensitivity caused by amino acid mutations around the drugbinding pocket of EGFR.Finally, the construction and usage of the D3EGFR website were introduced to assist users in effectively using the D3EGFRdb patient database and D3EGFRAI prediction model.

Construction of a clinical medication database for patients with EGFR mutations
A literature search was performed in PubMed [35] for relevant studies published before 16 February, 2023.The specific search strategy was as follows: (i) the title or abstract of the literature must contain the keywords 'EGFR mutation' and 'non-small cell lung cancer', (ii) the title or abstract of the literature must contain at least one approved EGFR-TKI agent, including 'tyrosine kinase inhibitors', 'gefitinib', 'erlotinib', 'icotinib', 'afatinib', 'osimertinib', 'olmutinib', 'dacomitinib', 'almonertinib' and 'furmonertinib' and (iii) the full text of the literature must contain the keywords about drug responses.Drug response is evaluated based on the World Health Organization criteria [36] and Response Evaluation Criteria in Solid Tumours (RECIST) V1.0 or V1.1 guidelines [37,38], which are divided into complete response (CR), partial response (PR), stable disease (SD) and progressive disease (PD).Therefore, the full text of the literature must contain at least one of the following four keywords: 'complete response', 'partial response', 'stable disease' and 'progressive disease'.

Prediction of the drug sensitivity of EGFR mutants based on molecular docking
Molecular docking is a structure-based strategy for predicting potential binding between a drug and a protein [39].Using this strategy, various docking models were constructed for different mutated EGFRs.The correlation between the docking score and bioactivity was then calculated to analyse the feasibility of drug sensitivity prediction of EGFR mutants.The bioactivity dataset used in this study is from the report by Robichaux et al. [40], covering 1349 experimentally measured biological activities (log(mutant/wild type) of IC 50 values) of 18 EGFR-TKIs and 77 EGFR mutants.After excluding mutants that only report mutant exons, such as 19del, we performed homology modeling to construct 3D structures for 64 mutants with clear mutation sites using the X-ray structure (PDB id: 3POZ, Resolution: 1.50 Å) as the template using MODELLER (version 9.24) [41].The generated mutant protein structures were protonated at pH 7.4 using pdb2pqr software [42].Molecular docking was performed using smina [43], which is a fork of AutoDock Vina [44] with improved docking performance.The docking boxes of all mutants were generated by extending 4 Å in each dimension based on the coordinates of the reference ligand in the crystal complex.Docking was performed using random seed 0.

Deep learning model for predicting the drug sensitivity of EGFR mutations
Given that deep learning can perform feature detection from large-scale bioactivity data and has f lexible neural network architectures, it has achieved remarkable success in the prediction of drug-target interactions [45].In addition, deep learning can be independent of the 3D structures of proteins, thereby avoiding biases caused by structural modeling.In this study, we explored deep learning models with different encoder combinations for drugs and protein mutants to identify the optimal model for drug sensitivity prediction.The drug and protein encoders were provided by DeepPurpose [46], with 80 encoder combinations (Table 1).Regarding datasets, 1/10 of the 1349 experimentally determined biological activity data were taken as the test set and the remaining data were further randomly divided into 10 different training and validation sets at a ratio of 9:1 for 10-fold Mean Square error where N represents the number of samples, while x i and y i represent the labels and predicted values of the samples, respectively.Then, the test set was used to evaluate the retrained models by merging the training and validation sets.Subsequently, we predicted the binding affinity of mutations collected in D3EGFRdb and mapped the predicted value with the drug response using a multinomial logistic regression analysis, which was taken from the sklearn machine learning library.For the multinomial logistic regression analysis, we used the solver of 'newton-cg', penalty of 'l2', C of 1.0, as well as the balanced mode to automatically adjust weights inversely proportional to class frequencies in the input data.Figure 1 illustrates the framework of the drug sensitivity prediction model D3EGFRAI.

Average clinical drug response (ACR) for the quantitative representation of drug response
Because of the inf luence of individual differences and other complex factors, patients with the same mutation type and the same drug administration may have different drug responses.For instance, in D3EGFRdb, five patients with the D770insSVD mutation were treated with erlotinib, among whom three showed PD response and the other two showed SD response, suggesting an unreasonable direct adaptation of individual labels for model evaluation.Thus, we defined an ACR to represent the overall efficacy of patients with the same mutation type and drug treatment.Drug responses were converted to numerical values such that CR/PR = −1, SD = 0 and PD = 1.Then, the same drug-mutant cases with some patient cases greater than 3 in D3EGFRdb were screened and their average clinical response value (ACRV) was calculated using equation 3 and was further converted to ACR using equation 4. Thus, we constructed a representative D3EGFRdb subset with 43 drug-mutant pairs for model evaluation.
where N CR/PR , N SD and N PD are the numbers of CR/PR, SD and PD patients with the same mutation type and drug treatment, respectively.

D3EGFRdb overview and statistical analysis
Through systematic literature search and manual collation, 141 studies on the clinical medication and drug responses of patients with EGFR mutations were identified, of which 108 were retrospective case reports/series, 26 were prospective clinical trials and 7 were prospective cohort studies.All patients with EGFR mutations were collected and annotated with clinical information (such as mutation site, gender, age, smoking status, pathology and EGFR-TKI treatment), clinical outcomes (such as drug response, time to progression, PFS and OS), study type and original literature for convenience.Based on this information, we constructed a clinical medication database D3EGFRdb for patients with EGFR mutations.D3EGFRdb contained a total of 1339 patients with 257 different mutation types, including 1032 patients in the response group (CR/PR/SD) and 307 patients in the non-response group (PD).
The reported mutation sites were mainly located in exons 18-21 (Figure 2A), which encode the tyrosine kinase domain of the EGFR gene and are the binding sites of available drugs.For instance, exon 19 deletion and exon 21 L858R are the most common EGFR mutations in these regions, whereas less common mutations include G719X and E709X in exon 18, S768I and T790M in exon 20 and L861Q and K860I in exon 21.Bringing a comparative perspective to the clinical application of EGFR-TKIs, the first-generation inhibitor gefitinib from AstraZeneca is the most extensively used and widely studied EGFR-TKI (951 cases, 71.0%), followed by another first-generation inhibitor erlotinib (256 cases, 19.1%).Gefitinib was found to be slightly better than erlotinib in terms of clinical drug response (gefitinib, CR/PR versus SD/PD: 51.2% versus 48.8%; erlotinib, CR/PR versus SD/PD: 44.1% versus 55.9%) ( Figure 2B).The relatively low use of the second-generation inhibitors afatinib and dacomitinib is associated with increased toxicity through non-specific targeting of wild-type EGFR [47,48].The third-generation EGFR-TKI osimertinib is the first FDA-and EMA-approved EGFR-TKI for treating patients with metastatic NSCLC who have a T790M resistance mutation [49].In addition, icotinib is a potent and specific EGFR-TKI that was approved in China in 2011 [50].The above information together with gender, age, smoking status, pathology, time to progression, PFS, OS, study type and original literature were collected in D3EGFRdb, making it a comprehensive database for retrospective medical records search.
Multivariate analysis of D3EGFRdb (Figure 3) revealed that females (females versus males: 47.8% versus 31.6%),individuals aged 60-79 years (34.1%) and non-smokers (non-smoker versus smoker: 39.1% versus 23.8%) were the most prevalent patients with EGFR mutations.This suggests that individual characteristics of patients are associated with the incidence of EGFR-mutant lung cancer, which is consistent with the findings of Zhang and Shigematsu [5,21].In addition, the predominant pathology was adenocarcinoma (ADC versus non-ADC: 68.1% versus 7.9%) in the reported patient series.Furthermore, point mutation is the most common mutation (48.6%), followed by deletion mutation (16.3%), mainly comprising the common L858R substitution in exon 21 and deletion mutations in exon 19.

External clinical dataset for assessment
To validate the prediction model using D3EGFRAI, the clinical information and outcomes of 102 patients treated with EGFR-TKIs in the Shanghai Pulmonary Hospital between March 2015 and October 2020 were used as the external clinical dataset (Table 2 and Table S3).The Ethics Committee of Shanghai Pulmonary Hospital approved this study and informed consent was waived because this was a retrospective study.Their age range was 33-85 years with a median age of 61 years and their histology was mostly adenocarcinoma.Objective responses to EGFR-TKIs were evaluated according to the RECIST V1.1 guidelines [38].There were 13 different types of drug-mutant information pairs (drugmutant pair hereinafter) in these 102 patient cases and their average clinical drug response (ACR) was re-evaluated.

No correlation was observed between molecular docking and the drug response
Drug sensitivity prediction with molecular docking focuses on somatic mutations in exons 18-21 of the EGFR tyrosine kinase domain and is based on the hypothesis that the docking score can serve as a metric for drug sensitivity.We calculated the docking scores for six approved drugs against 64 mutants and calculated their correlations with the experimental values.However, the calculated docking scores were not correlated with the experimental values (Maximal correlation R 2 = 0.143; Figure S1), indicating that molecular docking may be an unreliable method for drug sensitivity prediction.The poor results may be due to the low accuracy of homology modeling, which cannot accurately ref lect the protein structural changes caused by the residue mutation.

Deep learning models with high prediction accuracy
The correlations between the scores predicted by 80 deep learning models and the experimental values were calculated.There were 17 models showing an average correlation >0.8, demonstrating the effectiveness of deep learning models in predicting the binding affinity of protein mutants and EGFR-TKIs (Figure 4A and B).For these 17 models, we merged the training and validation sets for retraining and re-evaluated their correlation with the test set.The results showed that 14 models had a correlation of >0.8 in the test set (Figure 4C).Furthermore, a multinomial logistic regression model was applied to map the predicted value with   S4).Therefore, it was used as the final model for D3EGFRAI.
As mentioned above, there may be one or two main drug responses with a higher probability in the same drug-mutant pair.Figure 5 shows the predicted probability of each drug response for drug-mutant pairs in the representative D3EGFRdb subset calculated by the predict_proba function of the logistic regression model.For example, the predicted probabilities of CR/PR, SD and PD of Afatinib-A767dupACS were 47.0%, 47.75% and 5.3%, respectively, indicating that both CR/PR and SD are the most likely drug responses for this drug-mutant pair.Therefore, taking only the drug response with the highest probability as the output cannot provide comprehensive information from a computational perspective.Therefore, the prediction results displayed by D3EGFRAI are both the predicted most likely drug response and the associated probabilities of each drug response.By reevaluating the top two most likely drug responses predicted by the D3EGFRAI, the prediction accuracy improved from 0.81 to 0.95 for the representative D3EGFRdb subset.Finally, the D3EGFRAI model was applied to the external clinical dataset, in which the accuracy based on the drug response with the highest probability was 0.85 and that based on the top two drug responses with the highest probability was 0.92 (Table 3).In the external clinical dataset, 61.5% of drug-mutant pairs are not in the D3EGFRdb database, indicating that D3EGFRAI successfully maps binding affinity scores with drug response categories, thereby demonstrating its excellent generalization ability.

D3EGFR input and output
For convenience, the D3EGFR server was constructed by integrating D3EGFRdb and D3EGFRAI for users to retrieve the collected  drug response information and to predict drug response for rare and new mutations.Users can combine these two methods to determine the optimal drug treatment.The webserver supports English and Chinese (Simplified).D3EGFR is free for all users and no login is required.Figure 6 shows the brief interfaces of the D3EGFR input and output.Noteworthy, the Ministry of Food and Drug Safety has prohibited doctors from prescribing olmutinib to new patients.Therefore, we removed olmutinib from the approved drug list in the prediction.The drug response retrieval in D3EGFRdb provided the statistical results of the drug response ratios of the mutants and drugs, as well as the specific clinical characteristics and original literature of each patient case.Taking the mutation T790M + L858R as an example, there were 29 patient cases in D3EGFRdb, in which the CP/PR response rate of osimertinib was 78.5%, superior to gefitinib (0%), erlotinib (0%) and afatinib (14.3%), indicating that osimertinib is an effective drug for treating patients with the T790M + L858R mutation.In addition, the predicted result of D3EGFRAI shows that the T790M + L858R mutation is sensitive to osimertinib and resistant to gefitinib, erlotinib and afatinib, consistent with D3EGFRdb results and previous reports [53].In D3EGFRAI, users can obtain prediction results within 10 s by submitting a new mutation type.

DISCUSSION
Drug sensitivity changes caused by protein mutations have seriously affected the therapeutic benefits of targeted drugs.There are hundreds of clinically reported EGFR mutations that have inconsistent drug responses primarily because of mutationinduced changes in protein-drug-binding affinity.At the atomic level, mutated residues may increase steric hindrance effects or inf luence interactions between the protein and the ligand, thereby causing changes in the drug's binding ability and affecting the effectiveness of drug treatment.
As described in previous studies [54][55][56], there is increasing evidence that EGFR mutants can be used as predictive biomarkers for drug response in NSCLC.Therefore, this study selected EGFR mutation and drug response as variables to build a prediction model.Unlike previous studies [31,32,[57][58][59][60], we manually collected large-scale patient cases from the literature over the past two decades to perform data-driven drug response prediction for all approved EGFR-TKIs.However, among the patient cases that were collected, it was found that patients with the same mutation may have different responses to the same drug treatment.This may be related to the patients' individual characteristics and other factors.The specific reasons remain unclear.Therefore, we are collecting more data to introduce more variables in future model construction to enhance the model's prediction performance.

CONCLUSION
In this study, we developed the D3EGFR platform as a clinicallevel drug recommendation tool to promote the development of precision medicine.Specifically, D3EGFRdb can provide real patient cases with specific clinical information and medication outcomes for convenient query, whereas D3EGFRAI is a drug response prediction model that has satisfactory prediction performance in clinical patient cases.Both methods will be useful in future clinical applications and scientific research.Based on real patient cases and prediction results provided by D3EGFR, clinicians can further combine their clinical experience and medical tests to decide on a more reasonable method of medication.More reported and internal clinical trial results in the future will be helpful to further improve the prediction accuracy and reliability of D3EGFR.

Key Points
• D3EGFR can efficiently retrieve drug responses based on D3EGFRdb.• D3EGFR can make reliable drug response predictions based on D3EGFRAI.• Mutation scanning of crucial residues for approved drugs was performed.

Figure 1 .
Figure 1.Frameworks of prediction models using different encoder combinations for drugs and protein mutants.

Figure 2 .
Figure 2. Mutation status and clinical outcomes of patients in D3EGFRdb.(A) Distribution of hotspot mutations.(B) Distribution of patient cases with different drug responses to EGFR-TKIs.A grid point represents a case.

Figure 3 .
Figure 3. Pie charts of gender, age, smoking status, pathology, mutation type and mutant exon of patients in D3EGFRdb.

Figure 4 .
Figure 4. Evaluation of the deep learning models.(A) Heatmap of the Pearson correlation for 80 models.(B) Heatmap of mean squared error (MSE) for 80 models.(C) Model performance on the biological activity validation set, biological activity test set and representative D3EGFRdb subset.

Figure 5 .
Figure 5. Predicted probability of each drug response for drug-mutant pairs in the representative D3EGFRdb subset.

Figure 6 .
Figure 6.Input and output of the D3EGFR server.

Table 2 :
Clinicopathological characteristics of patients in the external clinical dataset the drug response based on the representative D3EGFRdb subset.Finally, the Morgan + CNN model had the best performance, with correlation coefficients of 0.81 in the biological activity validation dataset and 0.86 in the biological activity test dataset, and its prediction accuracy in the representative D3EGFRdb subset was 0.81 ( Table

Table 3 :
ACR and predicted probability of each drug response on the external clinical dataset